160
13
Useful Algorithms
13.2
Botryology
The term “botryology”, apparently coined by I. J. Good (1962), was introduced at
a time when the task of finding clusters was generally focused on objects arranged
in ordinary (Euclidean) space (e.g., stars clustered into galaxies). It signifies a more
general approach to finding clusters or clumps, concerned with logical and qualita-
tive relationships, chosen for their relevance to the matter in hand, rather than with
ordinary distance (or a metric satisfying a triangle inequality; see below). Possibly
relevance could be defined according to success in finding clusters or clumps, hence
permitting iterative refinement of the definition.
Since then the notion of clustering has anyway been somewhat generalized,
and typically now includes any process whereby relevance can lead to a numeri-
cal attribute (e.g., the conditional probability of use of an object). The objects are
nodes on a graph (Sect. 12.2), and the links between them (edges) give the relevance.
Thus, an elementa Subscript i jai j of the adjacency matrixupper AA gives the relevance ofii toj j. This may
not be the same as a Subscript j ia ji, giving the relevance of j j to ii; hence, the graph is a directed
one. On the other hand, association factors such asupper P left parenthesis i j right parenthesis divided by upper P left parenthesis i right parenthesis upper P left parenthesis j right parenthesisP(i j)/P(i)P( j) (the probability
of the joint occurrences divided by the product of the probabilities of the separate
occurrences) are symmetrical. The degree of clumpiness of a group of nodes could
then be given by summing the elements of the adjacency matrix of the group and
dividing by the number of elements in the group; a clump could be considered as
complete if the addition of an extra node would bring the clumpiness below some
threshold.
Possibly, it is useful to use the term “clustering” for the formal process (which
can be carried out on a computer) described in Sect. 13.2.1 and the term “clumping”
for a more general process (of which clustering would be a subset), for which formal
definitions might not always be available.
It is possible to conceive a highly automated mode of scientific investigation, in
which every object in the universe would be parametrized (by which it is meant that
a numerical value is assigned to every attribute). In order to investigate something
more specifically, the researcher would select the relevant collection of objects (e.g.,
“furry mammals”) and apply some kind of dimensional reduction to the dataset (if
the attributes were chosen from some vast standardized set, many would, of course,
have values of zero for a particular collection), preferably down to two or three, 3
after which a clustering algorithm would be applied. 4
13.2.1
Clustering
While supervised pattern recognition (i.e., with a teacher) corresponds to the most
familiar kind of pattern recognition carried out by human beings throughout their
3 For example, using principal component analysis (PCA) (q.v.).
4 See Gorban et al. (2005) for an example.